TAIL BOUNDS FOR STOCHASTIC APPROXIMATION 

MICHAEL P. FRIEDLANDER* AND GABRIEL GOH* 

Abstract. Stochastic-approximation gradient methods are attractive for large-scale convex 
optimization because they offer inexpensive iterations. They are especially popular in data-fitting 
and machine-learning applications where the data arrives in a continuous stream, or it is necessary to 
minimize large sums of functions. It is known that by appropriately decreasing the variance of the 
error at each iteration, the expected rate of convergence matches that of the underlying deterministic 
*y\ gradient method. Conditions are given under which this happens with overwhelming probability. 
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1. Introduction. Stochastic-approximation methods for convex optimization are 
prized for their inexpensive iterations and applicability to large-scale problems. The 
convergence analyses for these methods typically rely on expectation-based metrics for 
gauging progress towards a solution. But because the solution path is itself stochastic, 
practitioners — especially those relying on ad-hoc applications of such algorithms for a 
limited number of iterations — may pause and question how far an observed solution 
path is from the optimal value. The aim of this paper is to develop bounds on 
the probability of deviating too far from the solution. This result complements 
expectation-based analysis, and can furnish useful guidance for practitioners. 

Consider the differentiable convex optimization problem 

minimize f(x), 

I 16R" 

and the approximate gradient descent iteration 

OO x k +i = x k -a fe (V/(x fc ) +e k ), (1.1) 

iy> where e k is a random variable. The gradient residual e k may, for example, account 

■ * for the error incurred in the computation of the gradient V f(x k ). Bertsekas and 

Tsitsiklis [2] give mild conditions on e k and / under which f(x k ) — > inf f(x) in 

£T) probability. Friedlander and Schmidt [5] link the convergence rate to E ||e fe || , which 

t-H measures the variance in the error. Our goal here is to complement these results 

►> by providing conditions under which f(x k ) — > ~m£f(x) linearly with overwhelming 

probability; see Theorem 3.5 of section 3.2. 

Two applications of this framework are to provide tail bounds for stochastic- 
approximation and for incremental-gradient methods for minimizing the function 

f(x):=-EF(x,Z), 

where Z is a random variable. In the context of stochastic approximation, at each 
iteration a random sample {Z ly . . . , Z m } of size m k is generated to compute the 
search direction 

Vf(x k ) + e k = —J2F(x k ,Z l ). (1.2) 

m k i=1 
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In the case where Z takes on a finite number of values with uniform probability, 
then / reduces to the familiar case of sums of functions: 



M 



In this context, the incremental-gradient method chooses search directions 

Vf{x k ) + e k = — VV/sfo). (1.4) 

where the random sample S k C {1, . . . , M} of size m k is chosen without replacement. 
At one extreme is a fixed sample size m k (equal to 1, say), which yields an 
inexpensive iteration but generally does not converge to a minimizer unless a k — > 0; 
at best it converges sublinearly to the solution. At the other extreme is steepest 
descent, which surely converges linearly. As do Friedlandcr and Schmidt [5] and Byrd, 
Chin, Nocedal, and Wu [3], we consider a method for interpolating between these two 
extremes by increasing the sample size at a linear rate. We show that this and related 
algorithms converge linearly with overwhelming probability; see section 6. 

1.1. Assumptions and notation. We make the blanket assumption throughout 
that the function / is strongly convex and that its gradient V/ is uniformly Lipschitz 
continuous, i.e., there exist positive constants /i and L such that for all x,y € K n , 

f(y) > f(x) + (y-x, V/(a;)) + n/2\\y - x\\\ (1.5a) 

||V/(|/)-V/(aO||<£||y-a;||. (l-5b) 

Throughout, we use the notation p := 1—fi/L, which can be interpreted as a normalized 
condition number of /. Let 7r fe := f{x k ) — inf f(x) be the distance to the optimal value, 
and R k := J2i=o P % • Let T k = a{e 1 , e 2 , . . . , e k ) be the a-algebra generated by the 
error history. When the context is clear, \z] i denotes the ith component of a vector z. 
Except for the discussion in section 1.2, we make the assumption that a k = l/L. 

1.2. Existing convergence analysis. It is well known that deterministic steep- 
est descent with a constant stepsize a k = l/L converges linearly with a rate constant 
that depends on the condition number p, i.e., 

TTfc < p'tto; (1-6) 

see [9, section 8.6]. Because it is convenient to phrase our convergence results for the 
stochastic method in terms of its deviation from the deterministic case, we derive most 
results in terms of a tail bound on 

Pr(7r fc -A >e). (1-7) 

It is straightforward to recast these results to obtain tail bounds on Pr(7r fe > e). 

In general, if liminf fe ||e fc || 7^ 0, then we necessarily require a k — ¥ in (1.1) in 
order to ensure stationarity of limit points. Solodov [15] describes how bounding 
the steplengths away from zero yields limit points x that satisfy the approximate 
stationarity condition 

||V/(i)|| = 0(lima fc ). 

k 
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Bertsekas and Tsitsiklis [2] describe conditions for convergence of the iteration (1.1) 
when the steplengths a k satisfy the infinite travel and summable conditions 

oo oo 

y a k = oo and \, a k < °°. 

fe=0 fc=0 

For a non- vanishing stepsize, i.e., liminf fe a k > 0, Luo and Tseng [10] show that for 
a decreasing error sequence that satisfies ||e fc || = C(||x fe+1 — x k \\), the sequence 7r fc — > 
at an asymptotic linear rate. For a constant stepsize, Friedlander and Schmidt [5] give 
non-asymptotic rates that directly depend on the rate at which the error goes to zero. 

The convergence in probability of the stochastic-approximation method was first 
discussed by the classic Robbins [13] paper. More recently, Nemirovski, Juditsky, Lan, 
and Shapiro [12] show that for decreasing steplengths a k = C(l/fc), these methods 
achieve a sublinear rate according to E[Tr fc ] = 0(l/k); the iteration average has similar 
convergence properties, and it converges sublinearly with overwhelming probability. 

Bertsekas and Nedic [11] show that incremental-gradient methods for (1.3), with 
constant steplength a k = a, converge as 

En k <0(p k ) + 0(a). 

This expression is telling because the first term on the right-hand side decreases at a 
linear rate, and depends on the normalized condition number p; this term is present for 
any deterministic first-order method with constant stepsize. Thus, we can see that with 
the strong-convexity assumption and a constant steplength, the incremental-gradient 
algorithm has the same convergence characteristics as steepest descent, but with an 
additional constant-error term. 

2. Gradient descent with error. Our point of departure is the following result, 
which relates the progress in the objective value to the norm of the gradient residual. 




Proof. From [5, Lemma 2.1], 7r fc+1 < pir k + j^\\e k \\ . The required result follows 
from applying this inequality recursively from i = k — 1 down to i = 0. □ 

When the errors || e^, || are identically zero, the search directions in iteration (1.1) 
are simply gradient vectors, and this result reduces to the well-known fact that steepest 
descent decreases the objective value linearly with an error constant proportional to 
the conditioning of the problem, cf. (1.6). When the gradient residuals are nonzero, 
however, the inequality in Lemma 2.1 states that the deviation in progress that would 
have been achieved via steepest descent is bounded by the discounted sum of the errors 
made at each iteration. 

Friedlander and Schmidt [5, §2] note that the iteration (1.1) yields a monotonic 
decrease in the objective value if the error is small enough. In particular, 

TTfc+i < 7r fc if \\e k \\ 2 < ||V/(a; fe )|| 2 . 

The following is a probabilistic generalization of this result. The dependence on the 
c-algebra T k _^ is effectively a conditioning on the history of the algorithm. 
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Theorem 2.2 (Supermartingale property). 

E K +1 1 F k _ x ] < 7r k if E [|| efe || 2 | F k _ x ] < \\Wf(x k ) 



Proof. It follows from assumption (1.5b) that 



L, 



f(y) < f{x) + (y-x, V/(x)> + - \\y - x\\\ 

Use x = x k and y = x k+1 , as defined in (1.1), and simplify to obtain 

/(<c fc+ i) < /(s fc ) - ^<V/(z fe ) + e k ,Vf(x k )) + ^l|V/(x fc ) + e fe || 



= /(s fc ) - ^HV/fe)!! 2 - i<V/(a: fc ),e fc > 

+ ^l|V/(x fe )|| 2 + i(V/(x fe ),e fe ) 
= /(^) + ^(He fc || 2 -||V/(^)|| 2 ). 



1 " l|2 
efcll 



2L 



Then 



E K +1 | T k _ x ] = E [/ (% +1 ) | T k _ x ] - /(^) 



(0 
< E 



/(^) + ^(l|e fc H 2 -||V/(a ;fe )|| 2 ) 



•F, 



fc-i 



-/(O 



/(a; fc )-|-_L(E[||e fc || 2 |.Fk-i] 



(2.1) 



l|V/(a; fc )|| 2 ) - /(s,) < n k , 



where (i) follows from (2.1). D 

Example 2.3 (Gradient descent with independent Gaussian noise, part I). Let 
e k ~ N(0, a I). Because \\e k \\ is a sum of n independent Gaussians, it follows a 



chi-squared distribution with mean E||e fe || = no . Therefore, 



E7T fc -p'7T < 



2 fc-1 
fe— 1— i -wTi II ||2 na V^ k—l—i 

2L ^ P em =^rl^p ■ 

i=0 i=0 



fc-1 



(2.2) 



Take the limit inferior of both sides of (2.2), and note that lim^^^ X}j=o P 
1/(1 — p) = £//Z, and t/ius that 



E lim inf 7r fc < lim inf E n k < , 

k— >-oo k— voo 2/i 

where the first inequality follows from the application of Fatou's Lemma. Hence, 
even though linij.^^ n k may not exist, we can still bound the lower envelope on the 
suboptimality that is proportional to the variance of the error term. 

An immediate consequence of Lemma 2.1 is a tail bound via Markov's inequality: 



fc-i 



Pr(7r, - A > e) < Pr Q=r £ 



fc-1 



fc — 1 — i 1 1 1 1 2 \ i 

P \\ei\\ >e)< 



— T 
2Le ^ 



k—l — i tt, ii m2 

P E||ei|| . 



This inequality is too weak, however, to say anything meaningful about the confidence 
in our solution after a finite number of iterations. We are instead interested in Chernoff- 
type bounds that are exponentially decreasing in e, and in the parameters that control 
the size of the error. 
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3. Bounds for gradient descent with random error. Our first bound makes 
no assumption on the relation of the gradient errors between iterations. Thus, it may 
or may not allow for history-dependent errors, and we call this a generic error sequence. 
The second bound makes the stronger assumption about the relationship of the errors 
between iterations. 

3.1. Generic error sequence. Our first exponential tail bounds are defined in 
terms of the moment-generating function of the error norms ||e^|| , denoted by 

7fe (0):=Eexp(0|| efc || 2 ). 

We make the convention that 7/ c (6>) = +00 for 9 ^ dom.7;.. 



Theorem 3.1 


. (Tail bound for 


generic errors). 


For algorithm (1.1), 




PrK - 


k ^_ 


e) < inf 


I fc i=0 J 


(3.1a) 


tflk=1 f° r allk ( Le - 
the bound simplifies to 


, the error norms ||e fc |j 


are identically distributed), then 




Pi'Ufc - 


k ^ 


e) < inf |exp(- 

6>0 


-26Le/R k )j(6)}. 


(3.1b) 



Proof. Note that feto P^ 1 ^) l R k = 1- For 9 > 0, 



/ fe-l \ (k-\ k-l-i N 

Ee X p(^/- 1 - i ||e i || 2 j =Eexp(^^— 0fl fe ||ej 2 



~ ^ Rk 


i 

exp(6»i? fe ||e 


J 2 ) 


_. fe-l 

1 V^ fe-i- 


Si^fc), 





where (i) follows from the convexity of exp(-), and (ii) follows from the linearity of the 
expectation operator and the definition of r ) i . Together with Markov's inequality, the 
above implies that for all 6 > 0, 



fc-l— in m2 



Pr ^/"^ll^f^e =Pr exp 



vi=0 



fc-l 

k—l — i ii ii 2 

e,- 



'EA 



i=0 

/ fc-l 



> exp(6»e) 



< exp(-0e) E exp ^ p^ 1 



- 1 "*%ll a 



\ i=0 

fc-l 



<^EA W «- (3-2) 



i=0 
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This inequality, together with Lemma 2.1, implies that for all 9 > 0, 




< exp( ^:£V- i -s^iy, 

k i=0 

where we use the elementary fact that Pr(X > e) < Pr(Y~ > e) if X < Y almost surely. 
Redefine 6 as 0R k , and take the infimum of the right-hand side over 8 > 0, which gives 
the required inequality (3.1a). The simplified bound (3.1b) follows directly from the 
definition of R k . D 

In the case where the errors are identically distributed, there is an intriguing 
connection between the tail bounds described in Theorem 3.1 and the convex conjugate 
[•] of the cumulant-generating function of that distribution, i.e., log 07. 

Corollary 3.2 (Tail bound for identically-distributed errors). Suppose that the 
error norms \\e k \\ are identically distributed. Then for algorithm (1.1), 

logPr(^ fe - Ao > e) < - pog 7 (0r C*Le/R k ). 

Proof. Take the log of both sides of (3.1b) to get 

logPr(7T fc - Ao > e) < log ini : {exp(-20£e/fl fc ) 7 (0)} 

= - sup {(2Le/R k )9- log 7 (0)} , 
e>o 

which we recognize as the negative of the conjugate of log 7 (-) evaluated at 2Le/R k . D 
Note that these bounds are invariant with regard to scaling, in the sense that if 

the objective function / is scaled by some a > 0, then the bounds hold for ae. 

The following example illustrates an application of this tail bound to the case 

where the errors follow a simple distribution with a known moment-generating function. 
Example 3.3 (Gradient descent with independent Gaussian noise, part II). As 

in Example 2.3, let e k ~ _/V(0,ct /). Then \\e k \\ is a scaled chi-squared distribution 

with moment- generating function 

lk {6) = {l-2o- 2 9)- n '\ 9e [0,l/2a 2 ). 
Note that 

[log 7 (-)]*(M) = — -^ h -7. \og(na 2 /fi) for fi > no 1 . 

zo~ £ 

We can then apply Corollary 3.2 to this case to deduce the bound 

p , k . ,, ^ 2exp(l) Le \ n/2 ( Le \ ^ na 2 R k 

Pr(, k - P , >e)< ^—— • -^ j exp \--^-) for e > —. 

The bound can be further simplified by introducing an additional perturbation 5 > 
that increases the base of the exponent: 

Le 



Pr(7r fc - Ao > c) = O 



ex P I ~ S 2 D 
a R k 



for all 5 e [0,1), (3.3) 



which highlights the exponential decrease of the bound in terms of e. 
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3.2. Unconditionally bounded error sequence. In contrast to the previous 
section, we now assume that there exists a deterministic bound on the conditional 
expectation E [exp(6*||e fc || ) | Fk-i]- We say that this bound holds unconditionally 
because it holds irrespective of the history of the error sequence. 



Assumption 3.4 


. Assume that E 


exp(0| 


e k \ 


2 )l 


Fk 


_i] is finite 


over [0, a), 


for 


some 


a > 0. Th 


°refore there 


exists, for 


each k 


a 


deterministic function 


Ik ■■ 


R + -* 


R + U {00} 


such that 




















Ik 


(0) = 1 and 


E 


cxp(6»| 


e k 


I 2 )l 


T k 


-i]<%(9). 






(Thus 


the bound 


is tight at 9 = 


0.; 

















The existence of such a function in fact implies a bound on the moment-generating 
function of ||e fe || . In particular, 

lk {9) := E C xp(0K|| 2 ) = E [E [cxp(0|| efc || 2 ) | F k _y]) < E^ k (6) = %{6). (3.4) 

The converse, however, is not necessarily true. To see this, consider the case where the 
errors e-y, . . . , e k -i are independent Bernoulli-distributed random variables, and e k is 
a deterministic function of all the previous errors, e.g., Pr(e,j = 0) = Pr(ej = 1) = 1/2 
for i = 1, . . . , k — 1, and the error on the last iteration is completely determined by 
the previous errors: 

(l if e-y =e 2 = ■■• = e k _ 1 , 
e k = < 

otherwise. 



Therefore, Pr(e fc = 1) = (l/2)' £ 1 and Pr(e fe = 0) = 1 - (l/2)' £ \ and the moment- 
generating function of e k is "f k {9) = \ — 2~ (1 + cxp(6>)). Then, 

rI la is\ 1 Jexp(6») if e x =e 2 = ••• = e fc _ 1; 

E[exp(6»e fc )|e 1 ,...,e fc _ 1 ] = i 

I 1 otherwise, 

whose tightest deterministic upper bound is 7/-(0) = exp(6*). However, j k (9) > ^ k {9) 
for all 9>0. 

The following result is analogous to Theorem 3.1. 



Theorem 3.5 (Tail bounds for unconditionally bounded errors). Suppose that 
Assumption 3.4 holds. Then for algorithm (1.1), 



fc-i 

k— i-ii 



Pr(7r, - p% >e)< inf <^ cxp(~29Le) J] j z (9p 



i=0 
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Proof. The proof follows the same outline as many martingale-type inequalities 
[1, 4]. We obtain the following relationships: 



Eexp 



" fc-1 






■ 




" fc-1 




~| 


" 




a V^ fc-1— ill i|2 

e 2_^p Ik II 


( Je 


E 


exp 


*Xy- 


"1 — * II II* 

INI 


1 T 
- r k-2 




. i=0 






_ 


. i=0 




_ 


_ 












fc-2 


-| " 




= E 


E 


exp 


0p O |k_i 


m2 . /)V^ fe-1— in ii2 

i=0 


^k-2 






" Ai-2 






( ='e 


exp 


/iV^ fc-i— 

i=0 


||ej 2 Elexp^He^JI 2 )!^^] 






"r 


" fc-2 


-i- 






< E 


exp 


™' — n 


1 ii i|2 

II''- II 


7fe-iW 




(iv) fe_1 






< n^c^-*- 1 ). 

4 = 





where (i) follows from the law of total expectations, i.e., Ey[E[X|Y]] = E[A]; (ii) 
follows from the observation that the sum exp(#^ i=0 p ||ej|| ) is a deterministic 

function of e , . . . , e fc _ 2 , an d hence is measurable with respect to J~ k _\ an d can be 
factored out of the expectation; (iii) uses Assumption 3.4; and to obtain (iv) we simply 
repeat the process recursively. 

Thus, we now have a bound on the moment-generating function of the discounted 
sum of errors &^2 i=0 p ~ * || e^ || , and we can continue by using the same approach 
used to derive (3.2). The remainder of the proof then follows that of Theorem 3.1, 
except that the sums over i = 0, . . . , k are replaced by products over that same range. 
D 

In an application where both j k and j k are available, it is not true in general 
that either of the bounds obtained in Theorems 3.1 and 3.5 are tighter than the other. 
When only a bound j k that satisfies Assumption 3.4 is available, however, (which is 
the case in the sampling application of section 6) we could leverage (3.4) and apply 
Theorem 3.1 to obtain a valid bound in terms of 7/. by simply substituting it for j k . 
However, as shown below, in this case it is better to apply Theorem 3.5 because it 
yields a uniformly better bound: 



fc-i 



Pr (7T fc - p k 7r k > e) < inf i exp -26Le + J^ log 7^ 



fc-l-i\ 



(3.5) 



i=0 



while Theorem 3.1 (with j k replaced by j k ) gives us 



Pr ( ir k — p n Q > € ) < inf < exp —28Le + log 



E&'-'*W 



i=0 



, (3.6) 



where we rescale 9 by R k . A direct comparison of the two bounds show that they only 
differ by one term: 



log 



fc-i 






P ' 1 li{8Rk) 



fe-1 



vs. ^ log 7^(0/ ! l ) 
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Because R k = 'Yln—n p , the term in the log on the left is a convex combination of 

the functions j i . Therefore, 



log 



" fc-1 

t V^ k-i 

R~X P 

. K i=0 


~ X li(0Rk) 


(i) k ~ l k-l-i 
i=0 k 




(ii) K ~ L 

> ^ogiddRkP^^/Rk) 

i=0 






fc-1 



where (i) is an application of Jensen's inequality and the concavity of log, and (ii) 
follows from the convexity of the cumulant generating function. It is then evident 
that (3.5) implies (3.6). 

As with Corollary 3.2, a connection can be made between our bound and the 
infimal convolution when 7 is log-concave: 



logPr(7r fc -p vr > e) < 



fc-i 



[log7*(-p fc ' l )T 



i=0 



(2Le/R k ), 



where (g) denotes the infimal convolution operator. 

Example 3.6 (Gradient descent with independent Gaussian noise, part III). 
As in Example 3.3, let e k ~ N(0,a I). Because the errors e k are independent, 
E [ exp(6*||e fc || ) | J>._i] = Eexp(0||e fc | ) = 7^(0), which satisfies Assumption 3.4 with 
7fc(0) := 7fc(^)- Apply Theorem 3.5 to obtain the bound 



fc-i 



Pr (7T fe - p k ir Q >e)< inf I exp(-26»Le) ■ JJ (1 - 2a 



2 k-l-U-n/2 



e>o 



(3.7) 



i=0 



Apply Lemma A. 2 to obtain 

not 

p , k . x , /2exp(l) Le\ 2 / Le 

Pr (7r fc - p 7T > e) < 2 ex P 2 

\ na a z ) \ a 

where a = 1 — (logp)~ . We simplify the bound to obtain 



for e > 



naa 
2L 



Pv(7T k -p K n >e)=O 



cxp 



for all 8 6(0,1); 



(3.8) 



cf (3.3). 

As an aside, we note that we can easily accommodate correlated noise, i.e., e k ~ 
N(0, X ) where S is an n x n positive definite matrix. The error \\e k \\ then has the 
distribution of a sum of chi-squared random variables that are weighted according to 
the eigenvalues o~j of £ [7] 



|6fc| 



E2 2 



and so the above tail bounds hold with a = 
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The bounds obtained in Examples 3.3 and 3.6 illustrate the relative strengths 
of Theorems 3.1 and 3.5. Comparing (3.3) and (3.8), we see that the asymptotic 
bounds only differ by a factor of 1/Rk- Hence, for large e, the bound in Example 3.3 
is uniformly weaker than the bound in Example 3.6. Note that this holds despite the 
relaxation (i.e., Lemma A. 2) used to simply (3.7). 

4. From tail bounds to moment-generating bounds. Let Q be a cr-algebra. 
In this section we show that an exponential bound 

PrpQ > e | G) := E[l Xj > e \Q\< exp(~e 2 /^) (4.1) 

on the conditional probability [8, Definition 8.11] on a sequence of univariate random 
variables X i translates into a deterministic bound on the conditional moment-generating 
function 

E[e X p(9\\X\\ 2 )\g}, 

where X = (X 1; X 2 , ■ ■ ■ , X n ) is an n- vector. 

Lemma 4.1 (Bounds on moments). // (4.1) holds for some v > 0, then 
E[Xf v \g}<v\u v for all v = 0,1,2, ... . 



Proof. Use the substitution e v = r to obtain 



Pr (Y 2v >t\G) <c^(-t 1/v /v). 
Integrate to get 

/>oo />oo 

E[F 2l '|£]= / E[l y3 » >T \Q]dT < / cxp(-r 1/ 7^) dr = Y{\ + v)v v = v\v v , 
Jo - Jo 

where the first equality comes from the conditional layer-cake representation of positive 
random variables [16]. D 

With this result, we can translate the bound (4.1) into a bound on the moment- 
generating function of Y . 



Lemma 4.2 (Bound on conditional MGF). If (4.1) holds for some v > 0, then 

E[exp (BY 2 ) | G] < — L- for 6 e [0, 1/V). 
1 — tip 

Proof. Using the Taylor expansion of E[exp (6Y ) \G], 



E[exp (9Y 2 ) | G] = E 



y2l 



4=0 



Z* 



i=0 
i=0 i=0 
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Equality (i) is obtained via the conditional monotone convergence theorem [17, Theo- 
rem 9.7e], which allows us to exchange limits and conditional expectations; inequality 
(ii) is obtained using Lemma 4.1. □ 

We now generalize this last result to the case where X is a random n-vector. 



Theorem 4.3 (From tail bounds to moment-generating 


bounds) . 


Let X 


be a 


random n 


-vector for 


which the 


tail bound 


(4. 


1) holds for 


each i for 


some v 


>0. 


Then 




















E[exp(0| 


X\\ 2 )\G] 


1 
~ 1 - 9vn 




for 9 <G 


[0,1/z/n). 







Proof. From Lemma 4.2, 



E 



exp(9n[X]*)\g 



< 



1 
l-Onv' 



(4.2) 



The following inequalities hold: 



E 



exp (9\\X\ 



E 



0) 
< E 



expf 9Y J [X]1 

ex P (^x:>]f) 

n 

^2 n E[exp(en[X] 



Q 
Q 
Q 



(it) i 
~ l~9nv' 



where (i) follows from Jensen's inequality and (ii) follows from (4.2). D 

5. Convergence rates for linearly decreasing errors. Section 3 describes 
tail bounds for (1.7) in terms of any available bound on the moment-generating 
function of the error e^ . A goal of this section is to show that an exponential tail 
bound on the error translates to an exponential tail bound on (1.7). Thus we consider 
the case where the tails on each component of e k are exponentially decreasing (cf. 
Hypothesis 5.1.B below). We also consider two additional conditions on the error 
sequence, which illustrate the exponential tail bound's relative strength in the following 
hierarchy of assumptions. In section 6 we show how various sampling strategies satisfy 
these conditions. 



Hypothesis 5.1 (Uniform bounds) . Suppose that 




U k < 0(l)/3 k 


(5.1) 


for all k and for some j3 < 1. Assume that the following hold. 
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A. [Variance] 


V\\e k \\ 2 <U k : 


B. [Exponential Tail] 


Pr([e fc ]< > e|J" fc -i) < cxp(-e 2 /U k ); 


C. [Norm] 


\\e k f < U k . 



These conditions are ordered in increasing strength: if (C) holds, then (B) holds by 
Hoeffding's inequality (Theorem A. 5), and if (B) holds, then (A) holds because the 
exponential bound implies a bound on the second moment, i.e., 

/>oo />oo 

E [[e k ] 2 | F k _ x ] = / Pr([e fc ]? > e | F h _ x ) ale < / exp (-e 2 /U k ) ale < oo. 
Jo Jo 

5.1. Expectation-based and deterministic bounds. Although our main goal 
is to derive tail bounds, it is useful to compare these tail bounds against the expectation- 
based and deterministic bounds derived in Friedlander and Schmidt [5, Theorem 
3.3]. We give here a reformulation of these results, which rely on parts A and C of 
Hypothesis 5.1. 



Theorem 5.2 (Bound in 


expectation) . 


Suppose 


that 


Hypothesis 


5.1. A holds. 


Then 




















F,TT k - p k TT 


= 0([max{/3,p} 


+ C] fc ) 


for 


all 


C>o. 






Ifp^/3, 


then the bound h 


jlds with C = 0. 















Proof. The assumptions give a bound on E ||e fe || in terms of /3: 
E|k,|| 2 <[4<0(l)/3 fe <2T/3 fe 
for some r. For /3 < p, it follows from Lemma 2.1 that 

E^ fc - A < ^E^'^eini 2 ^ ^r-E^// 5 ) 1 ^ jp k ^ k - ( 5 - 2 ) 



i=0 i=0 



Similarly, for /3 > p, 



En k - Ao < T -Pj- E(p//3) 4 < l^k- (5.3) 

2=0 

We summarize these last two bounds in the single expression 

E^ fc - Ao < l maxj/^pA 1 k = 0([max{/3,p} + C] fe ) 

for all C > 0. 

If /? ^ p, then it follows from the second inequality in (5.2) and the first inequality 
in (5.3), and the summation formula for geometric series, that 

E7r fc - A < Imaxl/^A- 1 — ^— r = 0(max{/?,p} fe ). (5.4) 

£ |P-P| 

D 

The following result is identical to Theorem 5.2, except that ir k is deterministic. 
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Theorem 5.3 (Deterministic bound). Suppose that Hypothesis 5.1. C holds. Then 

n k -p k 7r Q = 0([ma,x{f3,p} + (} k ) for all ( > 0. 
If p ^ /3, then the bound holds with C = 0. 



5.2. Tail bounds. The next result gives exponential tail bounds in terms the 
iteration k, and the deviation e from the linear rate of deterministic steepest descent. 



Theorem 5.4 (Tail bounds). Suppose that Hypothesis 5.1.B holds. Then 



Pi'Ofc - p k K > e) = 0\ exp 



max 



{(3,P} 



c 



for some £ > 0. (5.5) 



Proof. From Theorem 4.3 the conditioned moment-generating function of ||e fc || is 
bounded: 

E[exp(0||e fc || 2 ) | J- fc _ x ] < I for e [0, l/U k n) . 

i — uu j,n 

We can now use Theorem 3.5, where we identify 7 with this bound (and define 
7(0) = 00 outside of the required interval), to obtain the tail bound 



' ■ ** u 1 cM ~ WLe \ 

«e[o,i/«r) \ n-=o (1 - 8U k n)p k 



Pr(TT k - p 7T > e) < _ inf ( s , , _ , _ ( _ ^ 



where <j := maxj, p ' E^fcTi- By (5.1), there exists some constant r independent of /?, 
p, and L such that 

n[/ fe < 2r/3 fe . (5.6) 

Define a = max{/3,p}. Now 

W J exp(-26»eL) 1 



Pr ( 7Ti. — p 7r n > e ) < inf . , , . , , 

V ° - / " 9e[0,i/<0 I Il,-=o C 1 - ^^V ) j 

= -f {-^ <*pH^) -). (5.7) 

9610,1/*) \ nto (1 - 20™*- 1 min{/3/p,p//3} 1 ) J 

where inequality (i) is obtained by substituting in (5.6), and equality (ii) follows from 
the definition of a. 

Define 7 = 1 + l/log(l/min{/3/p, p//3}) = 1 + 1/ |log/3 — logp|, and apply 
Lemma A. 4 to (5.7) to obtain 

Pr U k - A > e) < ( 6X ^ • -^ ) exp ( ^) , e > ja^r/L. (5.8) 

v / \7ra/ \ rot / 
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Next, note that mm{(3/p, p/(3} < 1, and so from (5.7), 

exp(-26»eL) I 



p r Ufc - P ttq > e < r inl 



fc-1 T \k 



ee[o,i/<r) I (i - 20ro: X) 



(0 ^exp(l) Le 



k 



fe-i ; ex P i k-1 

ra / V tq 



Lc 



e > ka k ~ l r/L, (5.9) 



where inequality (i) follows from Lemma A. 4. Inequalities (5.8) and (5.9) can be 
expressed together as 



i> , k ^ n ^ ^exp(l) Le 

Pr(7r fe - p tt > e) < 



Ik ra 



fc-i 



7fc 



cxp 



Lc 



fe-i 



e > 7/c"' £ 1 t/L, 



where j k = min{l + 1/| log/3 — logp|, k}. 

As e — > oo, 

Pr(7r fc - Ao > e) < O 



(5.10) 



exp ( —5 ■ 



Le 

k-l 



for some positive 8 independent of L and a. Also, as k — > oo, 

In order to further simplify this bound, take the logarithm of both sides: 



log Pr(7r fe - p k TT Q > e) < 7 log ( 
which implies (5.10), as required. D 



fc-i 



fc-i 



= o - 



fe-i 



COROLLARY 5.5. [Overwhelming tail bounds] Suppose that (5.5). For k fixed, for 
all A > there exists C \ such that 

Pr 0"fc ~ p k TT > e) < C A e~ A . 

Furthermore, for e fixed, there exists Ca such that for all A > 0, 

PT(n k -p k 7r >e)<C A A~ k 



Proof. The right-hand side of (5.10) can be equivalently expressed in two ways as 
'exp(l) Le 



Ik ra 



fc-i 



ik 



exp 



Le 

fc-i 



0(l)-e 7 cxp(-e-0(l)) 

exp (/(fc)) exp (- exp (g(k))) , 



where /(fc) = [7 log (Lea/^r) + 7] — /c^loga] and g(k) = log Lea — k log a. The result 
then follows from Lemma A.l. D 

6. Stochastic and sample average approximations. The results of section 5 
are agnostic to the source of the gradient errors that are made at each iteration. We 
translate these generic results into a sampling policies that yields a linear convergence 
rate, both in expectation and with overwhelming probability. 
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Theorem 6.1 (Stochastic-approximation convergence rates). Consider the 
stochastic- approximation algorithm described by (1.1) and (1.2) where 

— < 0{l)/3 k 
m k 

for all k for some (3 < 1. TVien i/ie following hold. 

1. [Expectation bound] If the variance of the error is bounded, i.e., 

supE ||V/(x) - VF(i, Z)\\ 2 < oo, 

X 

then 

En k -p k TT = 0([m&x{/3,p} + (] k ) for all ( > 0. 

If p ^ /3, then the bound holds with C = 0. 

2. [Tail bound] If the diameter of the error is bounded, i.e., 



for alii = 1, 



sup < sup[VF(x, z) 
n, then 



w£\VF(x,z)\i > < oo, 



Pr (7Tfc - P ^O > e ) = C ex P 



max{/3, p} 



c 



/or soJTie C > 0. 



Proof. 
Part j 
pendent, the expected sample error is equal to the sample average. Thus, 



Part 1 (Expectation Bound). Because the random variables Zj_, . . . , Z m are hide- 



E ||e fc |r = E || V/(s fc ) - VF{x k , Z % )\\ /m k 

< S upV\\Vf(x)-VF(x,Z)\\ 2 /m k < G(l)/3 fc , 

X 

therefore satisfying Hypothesis 5.1. A and thus the hypothesis of Theorem 5.2. 

Part 2 (Tail Bound). This follows from Hoeffding's Inequality; see Theorem A. 5. 
Thus we satisfy Hypothesis 5.1.B and therefore the hypothesis of Theorem 5.4. D 



Theorem 6.2 (Sample average gradient convergence rates). Consider the 
stochastic- approximation algorithm described by (1.1) and (1.4) where 



for all k for some (3 < 1. Then the following hold. 



(6.1) 
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1. [Expectation bound] If the population variance is bounded, i.e. 

1 M 
sup jj— - J2 Wf( x ) - M X )W < oo, 

i—l 

then 

F,ir k -p k TT Q = 0([m&x{/3,p} + (] k ) for all C > 0. 

Up 7^ P, then the bound holds with C = 0. 

2. [Tail bound] If the population diameter is bounded, i.e., 



sup I max[V/j(x)]j - min[V/j («)]* [• < 



/or all i = 1, . . . ,n, the 



Pr (7Tfc - P^o > e) = O exp 



C 



/or some £ > 0. 



max{/3,p} 
5. [Deterministic bound] If the diameter of the error is bounded, i.e., 



sup||/,(x)|] 2 < oo 



for all i = 1, . . . , n, the 



7r fc -pV = O([max{/3,p} + C] fc ) for all ( > 0. 



If p ^= /3, then the bound holds with C = 0. 



Proof. 

Part 1 (Expectation Bound). Let 



1 M 



Then from Friedlander and Schmidt [5, §3.2], 



ElleJI 2 



A _ ™fc\ g(gfc) < L _ m k -l \ sup^ S(x) < ^ fc 
V M / m k ~ \ M J m k 



therefore satisfying Hypothesis 5.1. A and thus the hypothesis of Theorem 5.2. 

Part 2 (Tail Bound). This follows from Serfling's Inequality; see Theorem A. 6. 
Thus we satisfy Hypothesis 5.1.B and therefore the hypothesis of Theorem 5.4. 
Part 3 (Deterministic Bound). Refer to Friedlander and Schmidt [5, §3.1]. □ 
The asymptotic notation in the theorem statements helps us simplify the results, 
however non asymptotic bounds are available explicitly within the proofs. Figure 6.1 
illustrates the non asymptotic bounds (5.4) and (5.10) that correspond to parts 1 and 2 
of Theorem 6.2; the deterministic bounds follow from Friedlander and Schmidt [5, 
Theorem 3.1]. 
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5 10 15 20 

Number of passes through data 



Fig. 6.1. An illustration of the bounds derived in Theorem 6.2; this figure plots the nonasymptotic 
bound shown in (5.10). The thick black line (bottom left) shows the bound in expectation (see Part 1 
of Theorem 6.2). For comparison, the thick red line (top right) shows the deterministic bound on 
the distance to the solution (see [5, Theorem 3.1]). The thin lines in between give the bounds on 
TT k — p 7r (1 that correspond to probabilities 10 for i = 10, 20, . . . , 100. Assume M = 300, /3 = 0.9, 
and p = 0.9. 



7. Numerical experiments. Figure 7.1 shows the results of a Monte Carlo 
simulation on a logistic regression problem, where 



fi(x) = log(l + exp[-6i 



*>]), 



<2j G R is a vector of input features, and bi G { — 1, 1} is the corresponding observation. 
For this problem, we generate a dataset with M — 100 pairs (a i; bA of random points. 
Algorithm (1.1) and (1.4), where the sample size satisfies (6.1) with j3 ss .91, is run 
10K times on this fixed dataset. The starting point between runs is the same, and 
the only difference is the randomness of the sampling. Figure 7.1 summarizes the 
results of this experiment. As expected, the sample paths are concentrated tightly 
around the mean. Furthermore, the probability of deviating from the mean decays 
doubly-exponentially (cf. 6.2), as evidenced by the linear tail shown in the bottom 
panel. 

A. Auxiliary results. Lemma A. 1. Suppose f(x) — 0(x°^ 1 ')exp(—0{x°^ 1 ')). 
Then for all A > there exists a constant C^ that depends only on A such that 



fix) < C A e 



(A.l) 



,0(1)', 



Suppose f(x) = exp(0(ir )) exp(— exp(0(x^' {L> ))). Then for all A > there exists 
C \ such that 



f(x) < C A A~ 



(A.2) 



18 



M. P. FRIEDLANDER and G. GOH 




0.5 - 



40 
iteration (k) 



10" 



g 
5 
<m 

Al 
H 







-6 - 




40 
iteration (k) 



,N0 



Fig. 7.1. Top panel: distance to solution for quantiles 1 — 0.5 J and O-S"*, j = —5 : 5. Bottom 
panel: probability of the deviation from expected value against a log-log y-axis, which exhibits the tail 
that converges with a doubly- exponential tail. 
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Proof. The statement follows by taking the logarithms on both sides of (A.l) 
and (A.2). D 

Lemma A. 2. For y <e (0, 1) and x e [0, 1], 

CO 

(l-x) l - 1JXo&v < JJ(l-xy*) (A.3) 

i=0 

Proof. To prove the lower bound, we use the following fact: 



Therefore, 



ln(l -x)> for all x € [0, 1). 

1 — x 



Y[(l - xy) = exp I ^T log (l - xy) 

i=l \j=l 

/ oo i \ 

tx l l x ~y) 



> eX p r£- 



re 



> exp I - / — - T di 



exp 



o l/x - y 
log(l-a;)' 



log(y) 

>(i- x y 1/l ° 6y . 



Thus, 



]J(1 xy 1 ) = {l-x) [](1 " xy 1 ) > (1 - x) 1 



> 1-1/ log y 

■*■ — ^y ; — v x - *; 11 V x — *0 ) — \^ — *. 

t=0 



as required. D 

Lemma A. 3. For y £ (0, 1) and x e [0, 1], 

( log(l - s/y) - log(l - zy^" 1 " 1 ) \ n h i. 

Proof. Similar to the proof of the previous inequality 

N I N \ 

J|(l - a;y*) = exp I ^ log (l - xy 1 ) J 

i=l \i=l / 

ex p(E-7- 



AT i 

xy 



> - - > 

\fr( l-xy 

> exp — / 7 di 



/o 1 - xy 

( 1 
> exp 



log(l - x) - log ( 1 - zy^ 



log(y) 
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7V+1 



]\(l-xy i )= H(l-(x/y)y*) 

log(l - x/y) - log(l - xy N+1 ) 



j=0 



> exp 



log(y) 



as required. D 

Lemma A. 4. Let k > 0, fi > 0, and e > 0. Then for y e (0, 1) and x e (0, 1], 



N-l 



inf { cM-Oev) J] (1 - ft"**) "* [ < f ^ • ? ) -p / " 

2=0 J 



e>o 



V a 



where a 



k \\og(l 



/;/) 



1 



Proof. By inverting both sides of (A. 3) we obtain the following inequality 

1 

1 



JJ(1 - xy) k < exp f - log(l - a;) 
i=0 ^ 

Therefore, for e > ax/v, 

inf | exp(-6>e^) JJ (1 - fccy*)~* 1 

4=0 J 

< inf J exp(-(9e^) J^(l - 6xy i )~ k 



log(i/y) 



(A.4) 



e>o 



log(i/y) 



log (1 - Ox) - 9ve 



i=0 

W f / 1 

< inf < exp — — 
e>o [ \ k 

— inf jexp (—a log (1 — Ox) — 9ev)\ 
e>o 

(«) / / /l a\ \ /l a 

= exp —a log 1 — ) X I — I I we 

\ V V x ve J J \x ve 



/exp(l) 



exp 



(-?) 



where (i) comes from (A.4); (ii) uses the substitution = l/x — a/ve, which can be 
shown to be the optimal choice of 9. Because 6 > 0, e > ax/v. D 
For the following theorems, we define the sample average 



_. m 

"m := / ,-^j 

rn * — » 



for a sequence of random variables {-Xi, . . . , X m }. 

Theorem A. 5 (Hocffding [6, Theorem 2]). Consider independent random vari- 
ables {X 1 , . . . ,X m }, Xi : f2 — > 5ft. If the random variables are bounded, i.e., 

d := sup X^uj) — inf A^(w) 
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is finite, then 

Pr (S m -ES m >e)< exp (-e 2 /r] m ) where n„ 



2m' 



Theorem A. 6 (Serfling [14, Corollary 1.1]). Let x 1 , . . . ,x M be a population, 
{X 1 , . . . , X m } be samples drawn without replacement from the population, and let 

d := rnaxs; — minx,. 

i i 

Then 

Pr (S m -ES m >e)< exp (-e 2 /?7 m ) where n m = — ( 1 — 

Because rj m is strictly decreasing in m, the Serfling bound is uniformly better than 
the Hoeffding bound. Note that the Serfling bound is not tight: in particular, when 
M = m (i.e., S m — E5 m ), the bound is not zero (except for degenerate population). 
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